Learning outcomes

After this lecture, you will be able to:

  1. Understand and explain measures of central tendency (mean, median, mode) and spread (range, IQR, variance, standard deviation)
  2. Calculate statistical measures using both R and Excel
  3. Choose appropriate statistical measures for your data
  4. Interpret statistical results in a biological context
  5. Begin troubleshooting common programming issues

Getting Started Checklist

Installation Support

If you have any trouble with installation:

  1. Check our troubleshooting guide
  2. Ask on the unit discussion board
  3. Visit help sessions
  4. Email the teaching team

Essential Setup

Statistical computing: A quick history

From calculators to computers

1800s: Mechanical calculators. Source

1960s: Statistical software BMDP and SPSS (not in image). Source

Statistical softare in the 1970s

1970s: SAS (Statistical Analysis System) Source

1976: Birth of S at Bell Labs. S-PLUS debuts in 1988. Source

The R story

  • Created at University of Auckland, New Zealand in 1993
  • Named after creators (Ross & Robert) – and inspired by S programming language
  • Developed rapidly in the 2000s
  • Designed specifically for statistical computing and graphics, but now used in many fields

The R graphical user interface. Source

R in today’s world

  • Leading tool in data science and statistics (although Python leads in majority of machine learning workflows)
  • Over 22,000 packages on CRAN – extensive statistical capabilities
  • Integration with other modern tools: Python, HTML, Javascript, Excel, AJAX…
  • Meets modern academic standards of reproducibility and increasingly preferred by statisticians

RStudio IDE. Source: Januar Harianto

Why statistical programming?

Statistical programming combines statistics and computer code to:

  1. Analyse data quickly and accurately – especially large datasets
  2. Share methods and results with others
  3. Automate complex calculations and visualise results clearly

It’s like having a powerful calculator that can help us tell stories about our data in a repeatable way.

Key statistical concepts

Population vs Sample


Population

  • All possible observations
  • Usually too large to measure
  • Example: All trees in a forest

Sample

  • Subset of the population
  • What we actually measure
  • Example: 100 trees we measured

Most (if not all) statistical analyses are based on samples, not populations.

Sampling in statistics

How well does a sample represent the population?

Some thoughts:

  • Sample size: Larger samples are more likely to represent the population
  • Sampling method: Random samples are more likely to be representative
  • Population variability: More variability means larger samples are needed

Samples will vary

Different samples give different results – suppose we have a population of 1000 trees and we randomly sample 6 tree heights. If this is done 3 times, it is likely that the samples will be different:

Code
set.seed(258) 
population <- rnorm(1000, mean = 20, sd = 5)

# create samples
sample1 <- sample(population, size = 6)
sample2 <- sample(population, size = 6)
sample3 <- sample(population, size = 6)
Code
# show samples
for (i in 1:3) {
   cat(sprintf("Sample %d: ", i), get(paste0("sample", i)), "\n")
}
Sample 1:  21.66633 22.61768 22.79266 17.64633 14.50462 17.9679 
Sample 2:  15.60759 14.0909 17.89364 18.10461 20.48023 22.88689 
Sample 3:  17.75913 15.89302 26.43149 27.08996 14.99993 34.06894 

So how do we make sense of these samples?

Descriptive statistics

We can describe our samples using:

  1. Measures of central tendency – describe the “typical” value
    • mean, median, mode
  2. Measures of spread – describe how much the data varies
    • standard deviation, variance (commonly used)
    • range, quartiles, IQR (for unique cases)
  3. Measures of uncertainty – describe how confident we are in our estimates
    • standard error
    • confidence intervals

Measures of central tendency

mean | median | mode

Mean – also known as the average

The mean is what most people call the “average”:

  • Add up all your numbers
  • Divide by how many numbers you have

Mathematical notation

  • Population mean: \mu = \frac{\sum_{i=1}^{N} x_i}{N}
  • Sample mean: \bar{x} = \frac{\sum_{i=1}^{n} x_i}{n}

Where x_i is each individual value, N is population size, and n is sample size.

Mean in R

We can save a group of numbers in a vector called scores in R:

# Our test scores
scores <- c(80, 85, 90, 95)

Manual calculations:

# manual calculation
(80 + 85 + 90 + 95) / 4
[1] 87.5
# Alternative way
sum(scores) / length(scores)
[1] 87.5

We can use the mean() function:

mean(scores)
[1] 87.5

Mean in Excel

Excel offers several ways to calculate the mean:

  1. Using AVERAGE function

    =AVERAGE(A1:A4)
    • Type =AVERAGE(
    • Select cells with your data
    • Press Enter
  2. Using AutoCalculate

    • Select your data cells
    • Look at bottom right
    • Average shown automatically

Median – the middle value

The median is the middle number when your data is in order:

  1. First, put your numbers in order
  2. Find the middle value
  3. If you have an even number of values, take the average of the two middle numbers

Example: House prices ($’000s): 450, 1100, 480, 460, 470, 420, 1400, 450, 470

Order: 450, 450, 420, 460, 470, 470, 480, 1100, 1400

How is it useful?

Median in R

R does all the ordering and finding the middle for us:

# House prices
prices <- c(450, 1100, 480, 460, 470, 420, 1400, 450, 470)

# Find median
median(prices)
[1] 470

Comparing the mean and median:

# Compare with mean
mean(prices)
[1] 633.3333

Which is a better measure for house prices?

Median in Excel

Excel provides two main ways to find the median:

  1. Using MEDIAN function

    =MEDIAN(A1:A9)
    • Type =MEDIAN(
    • Select your data range
    • Press Enter
  2. Alternative method

    • Sort your data first (use the Sort functionality in the Data tab)
    • Find middle value(s)
    • If even number of values, average the middle two

Mode – most frequent value

The mode is the value that appears most frequently in your data. It’s particularly useful for:

  • Categorical data (like blood types, eye colors)
  • Finding the most common item in a group
  • Data that has clear repeated values

Calculating the mode can be tricky, especially if there are multiple modes or no mode at all. This is why the mode is not commonly used in statistics.

Questions that the mode can answer

  • What is the most common blood type in a population?
  • What is the most common eye color in a group of people?

Calculating the mode in R

There is no built-in function to calculate the mode, so we use the modeest package:

if(!require("modeest")) install.packages("modeest")
Loading required package: modeest
library(modeest)

df <- c(1, 2, 3, 3, 4, 5, 5, 5, 6)
mlv(df, method = "mfv")  # most frequent value
[1] 5

If you were to do it yourself, how would you do it in R?

Use the table() function to count frequencies:

freq_table <- table(df) # Count frequencies of each value
# Find which value(s) appear most often
modes <- as.numeric(names(freq_table[freq_table == max(freq_table)]))
modes
[1] 5

Use run-length encoding after sorting:

sorted_df <- sort(df) # Sort the vector first
runs <- rle(sorted_df) # Use run-length encoding to find sequences
modes <- runs$values[runs$lengths == max(runs$lengths)] # Find the value(s) with max length
modes
[1] 5

Loop through the vector and count occurrences:

unique_vals <- factor(df) # Create a factor of unique values
counts <- tapply(df, unique_vals, length) # Count occurrences using tapply
modes <- as.numeric(names(counts[counts == max(counts)])) # Find which values have the maximum count
modes
[1] 5

The point is that it doesn’t matter how you calculate the mode, as long as you are able to do it. Also – if you needed this – aren’t you glad R has a package for it?

Mode in Excel

Excel provides several methods to find the mode:

1. MODE.SNGL function (single mode)

=MODE.SNGL(A1:A10)
  • Returns most frequent value
  • Returns #N/A if no repeats

2. MODE.MULT function (multiple modes)

=MODE.MULT(A1:A10)
  • Returns array of modes
  • Press Ctrl+Shift+Enter

Measures of spread

A biological example

Source: Adobe Stock # 85659279

Imagine sampling seagrass blade lengths from two different sites in a marine ecosystem, and they have the same mean length of 15.2 cm. Are both sites the same?

  • Site A (Protected Bay): 15.2, 15.0, 15.3, 15.1, 15.2 centimetres
  • Site B (Wave-exposed Coast): 12.0, 18.0, 14.5, 16.5, 15.0 centimetres

Comparing Different Measures

Code
# Plot seagrass lengths
library(ggplot2)
library(patchwork)
Warning: package 'patchwork' was built under R version 4.4.1
Code
seagrass_protected <- c(15.2, 15.0, 15.3, 15.1, 15.2)
seagrass_exposed <- c(12.0, 18.0, 14.5, 16.5, 15.0)

# Create plots for both sites
p1 <- ggplot() +
   geom_point(aes(x = 1:5, y = seagrass_protected), size = 3) +
   geom_hline(yintercept = mean(seagrass_protected), linetype = "dashed", color = "red") +
   labs(title = "Site A: Protected Bay", x = "Measurement", y = "Length (cm)") +
   ylim(10, 20)

p2 <- ggplot() +
   geom_point(aes(x = 1:5, y = seagrass_exposed), size = 3) +
   geom_hline(yintercept = mean(seagrass_exposed), linetype = "dashed", color = "red") +
   labs(title = "Site B: Wave-exposed Coast", x = "Measurement", y = "Length (cm)") +
   ylim(10, 20)

# Combine plots side by side
p1 + p2

Why do we need measures of spread?

  • Central tendency (mean, median, mode) only tells part of the story
  • Spread tells us how much variation exists in our data
  • Different measures of spread tell us different things:
    • Range: Overall spread of data
    • IQR: Spread of middle 50% of data
    • Variance: Average squared deviation from mean
    • Standard deviation: Average deviation in original units

Range – The simplest measure of spread

# Create our seagrass data
seagrass_protected <- c(15.2, 15.0, 15.3, 15.1, 15.2)  # Protected bay
seagrass_exposed <- c(12.0, 18.0, 14.5, 16.5, 15.0)    # Wave-exposed coast

# Calculate ranges
cat("Protected bay range:", diff(range(seagrass_protected)), "cm\n")
Protected bay range: 0.3 cm
cat("Wave-exposed range:", diff(range(seagrass_exposed)), "cm\n")
Wave-exposed range: 6 cm

Note

The range shows us that seagrass lengths are much more variable in the wave-exposed site!

Interquartile range (IQR): The middle 50%

The IQR tells us how spread out the middle 50% of our data is:

# Get quartiles for protected bay
quantile(seagrass_protected)
  0%  25%  50%  75% 100% 
15.0 15.1 15.2 15.2 15.3 
  • 25% of data below Q1 (1st quartile)
  • 75% of data below Q3 (3rd quartile)
  • IQR = Q3 - Q1

Why use IQR?

  • Ignores extreme values
  • Works with skewed data
  • More stable than range

Comparing Sites Using IQR

# Compare IQRs
pbay <- IQR(seagrass_protected)
pbay
[1] 0.1
exbay <- IQR(seagrass_exposed)
exbay
[1] 2
  • Protected bay IQR: 0.1 cm
  • Wave-exposed IQR: 2 cm

Note

The larger IQR in the wave-exposed site shows more spread in the typical seagrass lengths

Variance: a detailed measure of spread

Variance measures how far data points are spread from their mean by:

  1. Finding how far each point is from the mean
  2. Squaring these distances (to handle negative values)
  3. Taking the average of these squared distances

Why use variance?

  • Uses all data points (unlike IQR)
  • Less sensitive to outliers than range
  • Shows total spread in both directions

Key Points

  • Measured in squared units (cm²)
  • Larger variance = more spread

Calculating Variance in R

Code
# Calculate variance for both sites
cat("Protected bay variance:", var(seagrass_protected), "cm²\n")
Protected bay variance: 0.013 cm²
Code
cat("Wave-exposed variance:", var(seagrass_exposed), "cm²\n")
Wave-exposed variance: 5.075 cm²

Note

The larger variance in wave-exposed site shows more spread from the mean!

Standard deviation: a more interpretable measure

Standard deviation (SD, or \sigma for population, s for sample) is the square root of variance:

  • Tells us the “typical distance” from the mean
  • Easy to understand - similar to saying “± value” after a mean
  • Small SD means values cluster closely around mean
  • Large SD means values are more spread out

When and Why to Use It

  • Values are in the same units as your data (unlike variance)
  • Perfect for describing natural variation (height, weight, temperature)
  • Used in many statistical tests
  • Great for comparing different groups or datasets

Interpreting standard deviation (with R)

We can describe our seagrass lengths using mean ± standard deviation:

# Protected bay
mean_p <- mean(seagrass_protected)
sd_p <- sd(seagrass_protected)
cat("Protected bay:", round(mean_p, 1), "±", round(sd_p, 2), "cm\n")
Protected bay: 15.2 ± 0.11 cm
# Wave-exposed
mean_e <- mean(seagrass_exposed)
sd_e <- sd(seagrass_exposed)
cat("Wave-exposed:", round(mean_e, 1), "±", round(sd_e, 2), "cm\n")
Wave-exposed: 15.2 ± 2.25 cm

Tip

The ± tells us about the typical variation around the mean. Larger values indicate more spread!

Comparing Spread Measures

Measure Protected Bay Wave-exposed Coast What it Tells Us
Range 0.3 cm 6 cm Overall spread (sensitive to outliers)
IQR 0.1 cm 2 cm Middle 50% spread (ignores extremes)
Variance 0.01 cm² 5.07 cm² Average squared distance from mean
SD 0.11 cm 2.25 cm Average distance from mean (in original units)

Key Observations

  • Wave-exposed site shows consistently more variation
  • Each measure gives a different perspective
  • Choose based on your data and goals
  • Standard deviation is most commonly used in research papers

Basic Measures of Spread in Excel

Common Excel functions for measuring spread:

  1. Range: Use MAX() and MIN()
=MAX(A1:A10) - MIN(A1:A10)
  1. Quartiles and IQR: Use QUARTILE.INC()
For Q1: =QUARTILE.INC(A1:A10, 1)
For Q3: =QUARTILE.INC(A1:A10, 3)
For IQR: =QUARTILE.INC(A1:A10, 3) - QUARTILE.INC(A1:A10, 1)

Advanced Measures of Spread in Excel

Statistical functions for variance and standard deviation:

  1. Sample Variance: Use VAR.S()
=VAR.S(A1:A10)
  1. Sample Standard Deviation: Use STDEV.S()
=STDEV.S(A1:A10)

Tip

Use .P instead of .S for population measures: - VAR.P() for population variance - STDEV.P() for population standard deviation

References and Resources

Core Reading

  • Quinn & Keough (2024). Experimental Design and Data Analysis for Biologists. Cambridge University Press. Chapter 2: Things to know before proceeding.
  • Canvas site for lecture notes and additional resources

Thanks!

This presentation is based on the SOLES Quarto reveal.js template and is licensed under a Creative Commons Attribution 4.0 International License.